Brief Glamor Hacks
Eric Anholt started writing Glamor a few years ago. The goal was to
provide credible 2D acceleration based solely on the OpenGL API, in
particular, to implement the X drawing primitives, both core and
Render extension, without any GPU-specific code. When he started,
the thinking was that fixed-function devices were still relevant, so
that original code didn t insist upon modern OpenGL features like
GLSL shaders. That made the code less efficient and hard to write.
Glamor used to be a side-project within the X world; seen as something
that really wasn t very useful; something that any credible 2D driver
would replace with custom highly-optimized GPU-specific code. Eric and
I both hoped that Glamor would turn into something credible and that
we d be able to eliminate all of the horror-show GPU-specific code in
every driver for drawing X text, rectangles and composited
images. That hadn t happened though, until now.
Fast forward to the last six months. Eric has spent a bunch of time
cleaning up Glamor internals, and in fact he s had it merged into the
core X server for version 1.16 which will be coming up this July.
Within the Glamor code base, he s been cleaning some internal
structures up and making life more tolerable for Glamor developers.
Using libepoxy
A big part of the cleanup was a transition all of the extension
function calls to use his other new project,
libepoxy, which provides a sane,
consistent and performant API to OpenGL extensions for Linux, Mac OS
and Windows. That library is awesome, and you should use it for
everything you do with OpenGL because not using it is like pounding
nails into your head. Or eating non-tasty pie.
Using VBOs in Glamor
One thing he recently cleaned up was how to deal with VBOs
during X operations. VBOs are absolutely essential to modern OpenGL
applications; they re really the only way to efficiently pass vertex
data from application to the GPU. And, every other mechanism is
deprecated by the ARB as not a part of the blessed core context .
Glamor provides a simple way of getting some VBO space, dumping data
into it, and then using it through two wrapping calls which you use
along with glVertexAttribPointer as follows:
pointer = glamor_get_vbo_space(screen, size, &offset);
glVertexAttribPointer(attribute_location, count, type,
GL_FALSE, stride, offset);
memcpy(pointer, data, size);
glamor_put_vbo_space(screen);
glamor_get_vbo_space allocates the specified amount of VBO space
and returns a pointer to that along with an offset , which is
suitable to pass to glVertexAttribPointer. You dump your data into the
returned pointer, call glamor_put_vbo_space and you re all done.
Actually Drawing Stuff
At the same time, Eric has been optimizing some of the existing
rendering code. But, all of it is still frankly terrible. Our dream of
credible 2D graphics through OpenGL just wasn t being realized at all.
On Monday, I decided that I should go play in Glamor for a few days,
both to hack up some simple rendering paths and to familiarize myself
with the insides of Glamor as I m getting asked to review piles of
patches for it, and not understanding a code base is a great way to
help introduce serious bugs during review.
I started with the core text operations. Not because they re
particularly relevant these days as most applications draw text with
the Render extension to provide better looking results, but instead
because they re often one of the hardest things to do efficiently with
a heavy weight GPU interface, and OpenGL can be amazingly heavy weight
if you let it.
Eric spent a bunch of time optimizing the existing text code to try
and make it faster, but at the bottom, it actually draws each lit
pixel as a tiny GL_POINT object by sending a separate x/y vertex value
to the GPU (using the above VBO interface). This code walks the array
of bits in the font and checking each one to see if it is lit, then
checking if the lit pixel is within the clip region and only then
adding the coordinates of the lit pixel to the VBO. The amazing thing
is that even with all of this CPU and GPU work, the venerable 6x13
font is drawn at an astonishing 3.2 million glyphs per second. Of
course, pure software draws text at 9.3 million glyphs per second.
I suspected that a more efficient implementation might be able to draw
text a bit faster, so I decided to just start from scratch with a new
GL-based core X text drawing function. The plan was pretty simple:
- Dump all glyphs in the font into a texture. Store them in 1bpp
format to minimize memory consumption.
- Place raw (integer) glyph coordinates into the VBO. Place
four coordinates for each and draw a GL_QUAD for each glyph.
- Transform the glyph coordinates into the usual GL range (-1..1)
in the vertex shader.
- Fetch a suitable byte from the glyph texture, extract a single bit
and then either draw a solid color or discard the fragment.
This makes the X server code surprisingly simple; it computes integer
coordinates for the glyph destination and glyph image source and
writes those to the VBO. When all of the glyphs are in the VBO, it
just calls glDrawArrays(GL_QUADS, 0, 4 * count). The results were
encouraging :
1: fb-text.perf
2: glamor-text.perf
3: keith-text.perf
1 2 3 Operation
------------ ------------------------- ------------------------- -------------------------
9300000.0 3160000.0 ( 0.340) 18000000.0 ( 1.935) Char in 80-char line (6x13)
8700000.0 2850000.0 ( 0.328) 16500000.0 ( 1.897) Char in 70-char line (8x13)
6560000.0 2380000.0 ( 0.363) 11900000.0 ( 1.814) Char in 60-char line (9x15)
2150000.0 700000.0 ( 0.326) 7710000.0 ( 3.586) Char16 in 40-char line (k14)
894000.0 283000.0 ( 0.317) 4500000.0 ( 5.034) Char16 in 23-char line (k24)
9170000.0 4400000.0 ( 0.480) 17300000.0 ( 1.887) Char in 80-char line (TR 10)
3080000.0 1090000.0 ( 0.354) 7810000.0 ( 2.536) Char in 30-char line (TR 24)
6690000.0 2640000.0 ( 0.395) 5180000.0 ( 0.774) Char in 20/40/20 line (6x13, TR 10)
1160000.0 351000.0 ( 0.303) 2080000.0 ( 1.793) Char16 in 7/14/7 line (k14, k24)
8310000.0 2880000.0 ( 0.347) 15600000.0 ( 1.877) Char in 80-char image line (6x13)
7510000.0 2550000.0 ( 0.340) 12900000.0 ( 1.718) Char in 70-char image line (8x13)
5650000.0 2090000.0 ( 0.370) 11400000.0 ( 2.018) Char in 60-char image line (9x15)
2000000.0 670000.0 ( 0.335) 7780000.0 ( 3.890) Char16 in 40-char image line (k14)
823000.0 270000.0 ( 0.328) 4470000.0 ( 5.431) Char16 in 23-char image line (k24)
8500000.0 3710000.0 ( 0.436) 8250000.0 ( 0.971) Char in 80-char image line (TR 10)
2620000.0 983000.0 ( 0.375) 3650000.0 ( 1.393) Char in 30-char image line (TR 24)
This is our old friend x11perfcomp, but slightly adjusted for a modern
reality where you really do end up drawing billions of objects (hence
the wider columns). This table lists the performance for drawing a
range of different fonts in both poly text and image text
variants. The first column is for Xephyr using software (fb)
rendering, the second is for the existing Glamor GL_POINT based code
and the third is the latest GL_QUAD based code.
As you can see, drawing points for every lit pixel in a glyph is
surprisingly fast, but only about 1/3 the speed of software for
essentially any size glyph. By minimizing the use of the CPU and
pushing piles of work into the GPU, we manage to increase the speed of
most of the operations, with larger glyphs improving significantly
more than smaller glyphs.
Now, you ask how much code this involved. And, I can truthfully say
that it was a very small amount to write:
Makefile.am 2
glamor.c 5
glamor_core.c 8
glamor_font.c 181 ++++++++++++++++++++
glamor_font.h 50 +++++
glamor_priv.h 26 ++
glamor_text.c 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++
glamor_transform.c 2
8 files changed, 741 insertions(+), 5 deletions(-)
Let s Start At The Very Beginning
The results of optimizing text encouraged me to start at the top of
x11perf and see what progress I could make. In particular, looking at
the current Glamor code, I noticed that it did all of the vertex
transformation with the CPU. That makes no sense at all for any GPU
built in the last decade; they ve got massive numbers of transistors
dedicated to performing precisely this kind of operation. So, I
decided to see what I could do with PolyPoint.
PolyPoint is absolutely brutal on any GPU; you have to pass it two
coordinates for each pixel, and so the very best you can do is send it
32 bits, or precisely the same amount of data needed to actually draw
a pixel on the frame buffer. With this in mind, one expects that about
the best you can do compared with software is tie. Of course, the CPU
version is actually computing an address and clipping, but those are
all easily buried in the cost of actually storing a pixel.
In any case, the results of this little exercise are pretty close to a
tie the CPU draws 190,000,000 dots per second and the GPU draws
189,000,000 dots per second. Looking at the vertex and fragment
shaders generated by the compiler, it s clear that there s room for
improvement.
The fragment shader is simply pulling the constant pixel color from a
uniform and assigning it to the fragment color in this the simplest of
all possible shaders:
uniform vec4 color;
void main()
gl_FragColor = color;
;
This generates five instructions:
Native code for point fragment shader 7 (SIMD8 dispatch):
START B0
FB write target 0
0x00000000: mov(8) g113<1>F g2<0,1,0>F align1 WE_normal 1Q ;
0x00000010: mov(8) g114<1>F g2.1<0,1,0>F align1 WE_normal 1Q ;
0x00000020: mov(8) g115<1>F g2.2<0,1,0>F align1 WE_normal 1Q ;
0x00000030: mov(8) g116<1>F g2.3<0,1,0>F align1 WE_normal 1Q ;
0x00000040: sendc(8) null g113<8,8,1>F
render ( RT write, 0, 4, 12) mlen 4 rlen 0 align1 WE_normal 1Q EOT ;
END B0
As this pattern is actually pretty common, it turns out there s a
single instruction that can replace all four of the moves. That should
actually make a significant difference in the run time of this shader,
and this shader runs once for every single pixel.
The vertex shader has some similar optimization opportunities, but it
only runs once for every 8 pixels with the SIMD format flipped
around, the vertex shader can compute 8 vertices in parallel, so it
ends up executing 8 times less often. It s got some redundant moves,
which could be optimized by improving the copy propagation analysis
code in the compiler.
Of course, improving the compiler to make these cases run faster will
probably make a lot of other applications run faster too, so it s
probably worth doing at some point.
Again, the amount of code necessary to add this path was tiny:
Makefile.am 1
glamor.c 2
glamor_polyops.c 116 ++++++++++++++++++++++++++++++++++++++++++++++++++--
glamor_priv.h 8 +++
glamor_transform.c 118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
glamor_transform.h 51 ++++++++++++++++++++++
6 files changed, 292 insertions(+), 4 deletions(-)
Discussion of Results
These two cases, text and points, are probably the hardest operations
to accelerate with a GPU and yet a small amount of OpenGL code was
able to meet or beat software easily. The advantage of this work over
traditional GPU 2D acceleration implementations should be pretty clear
this same code should work well on
any GPU which offers a reasonable
OpenGL implementation. That means everyone shares the benefits of this
code, and everyone can contribute to making 2D operations faster.
All of these measurements were actually done using Xephyr, which
offers a testing environment unlike any I ve ever had build
and test hardware acceleration code within a nested X server,
debugging it in a windowed environment on a single machine. Here s how
I m running it:
$ ./Xephyr -glamor :1 -schedMax 2000 -screen 1024x768 -retro
The one bit of magic here is the -schedMax 2000 flag, which causes
Xephyr to update the screen less often when applications are very busy
and serves to reduce the overhead of screen updates while running
x11perf.
Future Work
Having managed to accelerate 17 of the 392 operations in x11perf, it s
pretty clear that I could spend a bunch of time just stepping through
each of the remaining ones and working on them. Before doing that, we
want to try and work out some general principles about how to handle
core X fill styles. Moving all of the stipple and tile computation to
the GPU will help reduce the amount of code necessary to fill
rectangles and spans, along with improving performance, assuming the
above exercise generalizes to other primitives.
Getting and Testing the Code
Most of the changes here are from Eric s glamor-server branch:
git://people.freedesktop.org/~anholt/xserver glamor-server
The two patches shown above, along with a pair of simple clean up
patches that I ve written this week are available here:
git://people.freedesktop.org/~keithp/xserver glamor-server
Of course, as this now uses libepoxy, you ll need to fetch, build and
install that before trying to compile this X server.
Because you can try all of this out in Xephyr, it s easy to download
and build this X server and then run it right on top of your current
driver stack inside of X. I d really like to hear from people with
Radeon or nVidia hardware to know whether the code works, and how it
compares with fb on the same machine, which you get when you elide the
-glamor argument from the example Xephyr command line above.